Model Selection

Multimodal Pretraining

# Multimodal Pretraining

Vit So400m Patch14 Siglip Gap 896.pali Pt

Vision model based on SigLIP image encoder, employing global average pooling, part of the PaliGemma project

Vit So400m Patch14 Siglip Gap 896.pali2 3b Pt

A vision model based on the SigLIP image encoder, employing global average pooling, and part of the PaliGemma2 project

Vit So400m Patch14 Siglip Gap 448.pali Mix

A vision-language model based on the SigLIP image encoder, utilizing global average pooling, suitable for multimodal tasks.

Vit Base Patch16 Siglip Gap 224.webli

Vision Transformer model based on SigLIP, containing only the image encoder part, employing a global average pooling strategy

Image Classification

Vit Large Patch14 Clip 224.datacompxl

A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.

Image Classification

Convnext Base.clip Laion2b Augreg

ConvNeXt Base image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports image feature extraction

Image Classification

Convnext Base.clip Laion2b

CLIP image encoder based on ConvNeXt architecture, trained by LAION, suitable for multimodal vision-language tasks

Image Classification

Minivla Wrist Vq Libero90 Prismatic

MiniVLA is a vision-language-action model focused on robotics, supporting multimodal tasks from image-text to text.

Transformers English

Minivla History2 Vq Libero90 Prismatic

MiniVLA is a compact yet high-performance vision-language-action model, compatible with Prismatic VLMs training scripts, suitable for robotics and multimodal tasks.

Transformers English

Minivla Vq Libero90 Prismatic

MiniVLA is a lightweight vision-language model compatible with the Prismatic VLMs training framework, supporting multimodal tasks from image-text to text.

Transformers English

CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.

Multimodal Fusion

Transformers English

Vit Base Patch16 Plus Clip 240.laion400m E31

A vision-language dual-purpose model trained on the LAION-400M dataset, supporting zero-shot image classification tasks

Image Classification

Merlin is a 3D vision-language model for computed tomography, pretrained using both structured electronic health records and unstructured radiology reports.

Text-to-Image English

Openvla 7b Prismatic

OpenVLA 7B is an open-source visual-language-action model compatible with Prismatic VLMs training script format, supporting full fine-tuning of 7.5 billion parameters.

Transformers English

LVM is an innovative visual pretraining model that achieves large-scale visual learning by converting visual data into visual sentences and making predictions in an autoregressive manner.

BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.

Transformers English

BLIP-2 is a vision-language model based on OPT-2.7b, which achieves image-to-text generation by freezing the image encoder and large language model while training a query transformer.

Transformers English

MatCha is a vision-language model focused on chart understanding and mathematical reasoning, enhancing processing capabilities through joint modeling of charts and language data

Transformers Supports Multiple Languages

Matcha Plotqa V1

MatCha model fine-tuned on the PlotQA-v1 dataset, specializing in visual question answering tasks for charts, with excellent performance in chart deconstruction and numerical reasoning

Transformers Supports Multiple Languages

Markuplm Base Finetuned Websrc

MarkupLM is a multimodal pretrained model designed for rich visual document understanding and information extraction tasks, combining text and markup language information.

Multimodal Fusion

Transformers English

Taiyi Roberta 124M D V2

A specially pretrained English multimodal text encoder based on RoBERTa-base architecture, trained with 1 million image-text pairs

Multimodal Fusion

Transformers English

Taiyi Vit 87M D

An English MAP visual encoder specially pretrained on COCO and Visual Genome datasets, based on ViT-base architecture

ViLT is a vision-and-language Transformer model pretrained on the GCC+SBU+COCO+VG dataset, focusing on joint understanding tasks of images and text.

Mengzi Oscar Base Retrieval

A Chinese image-text retrieval model fine-tuned on the COCO-ir dataset based on the Chinese multimodal pretraining model Mengzi-Oscar

Transformers Chinese

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase